Searching right sizing storage, our experience and incidents

Menu Menu

Shinji KONO, Associate Professor, University of the Ryukyus


Storage choice for our faculty

  300 students ( under or graduate )
  20 staff
  studying
     Hardware such as FPGA, CPU
     A.I. such as Neural network, image recognition or learning
     Networking such as Internet, Wifi
     Software such as Operating Systems or Programming Languages


Every students have own Note PC (MacBook)

    We are using OS X since 2002 (easier than Tokyo University)
    The parents always ask the reason why
    The only commercially supported Unix for customers
    BYOD from 2002


Our System is updated every 5 years

    1997 : Sun Enterprise 3000 x 2 + NetApp
    2001 : Cheap PC (MiNT PC) x 2 + Newtech RAID x 2
    2005 : HP DL380 x 2 + HP DL380 RAID + Apple Xserve / Xserve RAID
    2006 : 1U CoreDuo x 180 cluster
    2010 : Fujitsu Blade Server with SAN / VMWare
    2015 : Dell Poweredge R630 x 4 with SAN  / KVM + GFS2
       Sakura cloud 
    2020 : Dell Poweredge R740 x 4 + R740xd x 2 (32GB / 48TB)
       AWS educate / Sakura cloud


Requirements

  No critical processing (for student study only)
  Maintained by a group of students with supporting staff
  We want to know the internals


Possible choice / Evaluation

  VMWare
  Datarium
  Pure Storage
  Nutanix


VMWare

* very expensive *

it is just a Solaris with SAN

Very flexible, various template for students and researchers


GFS2 + KVM

open source

Linux PC Cluster and GFS2 is very difficult handle

   the lock manager is the single point of failure
   very easy to stop all of them
   migration is easy on GFS2 / SAN
   very slow


Pure Storage

FPGA base compression storage

SAN

HCI

choice


HCI

SDS Software definable storage


Datarium / Nutanix

So called HCI. The internal is hidden in their implementations.

We cannot login into the hypervisor of Nutanix system.


Our requirements

Educational purpose

  We want to see the internals
  Open sources 
  Maintenance by a group of students


Current System

No iSCSI network

HDD node x 2 + SSD node x 4 with GPU

Ceph

Sakura Rental server


Ceph

Distributed Object Storage

   OSD        Object storage
   MDS        Meta Data store
   MON        Monitor / lock / queue
   cephadm    container based package / administration tool

All Ceph commands have to be executed in a cephadm container


Ceph requirements

No multiple OSD on a HDD/SSD

Each OSD require 4GB buffer memory (what?!)

At lease 3 MONs are required

We should not run MDS/MON on OSD node

So...

  Ceph requires rather expensive resources


Netapp trouble at 1998

Bad Configuration setting stop everything.

International telephone call in the midnight.

   It has a snapshot including system configuration (very good)

Maker Supported Storage is very good.


RAID trouble

RAID technology is not a backup

Rebuild procedure

Bad things my happen during rebuild (as always)


GFS2 trouble

Linux cluster with Colosync

    DLM is the single point of failure

Very easy to stop


Recovery

GFS2 in on LVM (possibly software raided)

read Volume Group configuration in the system area. (QNAP requires the same kind of trick)

remove cluster flag in VG header     

   status = ["RESIZEABLE", "READ", "WRITE", "CLUSTERED"]

remove SCSI reservation
  sg_persist --out --no-inquiry --clear --param-rk=0xb8c90000 --device=/dev/sdb

mount without lock  mount -o lockproto=lock_nolock /dev/mapper/vg_whisky-lv_whisky /mnt/whisky/


Ceph trouble

Ceph contains everything in OSD, but ...

Without MDS (meta data), no-way to access the contents.

If we change the IP address, linkage MDS / OSD is lost, that is, we lost everything.

So Basically,

   we cannot change the ip address of OSD/MDS/MON


Recovery

   There are possible ways to recover data from OSD, but not so easy

But we have

NetGear the cheap RAID disks with pair, which help us the recovery.


So which is better, GFS2 or Ceph

GFS2 rely on RAID / SAN technology
   direct write thru iSCSI

Ceph is based on Erasure Coding
   large memory buffer

Ceph cannot handle yet...
   heterogeneous configuration (NVMe/SSD/HDD)
   dedup


Thank you!


Shinji KONO, Associate Professor, University of the Ryukyus / Fri Oct 23 14:50:56 2020